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ABSTRACT 

Four problems faced by the staff of the California 
Assessment Program (CAP) were solved ty applying Rasch scaling 
techniques: (1) item cultural bias in the Entry Level Test (ELT) 
given to all first grade pupils; (2) nonlinear regression analysis of 
the third grade Reading Test scores; (3) comparison cf school growth 
from grades two to three, using the Reading Test; .and (U) analysis of 
growth from grades two to three, based on the Reading Test, in the 
areas of word identification, vocabulary, comprehension, and study 
locational skills. Solution cf the problems demonstrated that 
existing Fasch Models have practical significance and should be more 
widely used by educators. (MH) 
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This paper presents four problems that the staff of the California 
Assessment Program (CAP), the statewide testing program for the state of 
California, found difficult to solve. Each of these problems proved to 
be readily solvable when techniques being developed by advocates of the 
Rasch model were applied. The purpose of this paper is not to present new 
approaches for using the Rasch model, but to demonstrate that the approaches 
that have been developed already have great practical significance and should 
be disseminated and used more widely by practitioners. 

Item Cultural Bias in the Entry Level Test 

In order to collect baseline data, the California Assessment Program 
annually administers the Entry Level Test (ELT) to every first grader in 
California each September. At the time of the first administration of the 
ELT in September, 1973, it consisted of 36 items covering five subtests. 
Along with item data, the ethnic group of each child was reported. A one 
percent systematic sample of the state resulted in a file of 3,010 pupils 
available for analysis. The problem was to determine which, if any, of 
the test items contained cultural bias. 

The approach taken by the staff at that time was to run a factor 
analysis, considering responses to each of the items and membership in 
each of the ethnic groups to be a variable. The factor structure of the 
test itself was quite clean; most of the 36 items loaded into only one 
factor, and loaded jointly only with all the other items in their own 
subtest.. The loadings of the ethnic groups were much less definitive, and 
it was not known if the problem was one of statistics, such as the restric- 
tion of range of interitem correlations when items are extremely easy 
(several items in the test had p-values greater than .9), or if the items 
in the tests were truly unbiased. The results of this analysis were reported 
by Lorrie Shepard at the 1975 NQE Annual Meeting in a paper entitled 
Developing the California Entry Level Test : Construct Validity £f the 
Subtests . The opinion of the staff was that the amount of cultural bias 
in the test was unclear, although there was belief that it was relatively 
unbiased. 

The data were reanalyzed l^st spring using the Rasch model. Rasch 
item difficulties were computed separately for whites, blacks, and Spanish- 
sumamed children.' Plots of Ras^ch item difficulties for whites versus 
blacks and whites versus Spanis ---sumamed were constructed. These two 
plots are shown as figures 1 an 2. Each plot demonstrated two distinct 
patterns of straight lines - or line consisting of the first six items on 
the test, and one very difficu item from the end of the test, and a second 
line drawn from the remaining terns. 



*Paper presented at the 1979 Ar*^i*4i-t- Meeting of the National Council on 
Measurement in Education, Sar Tiwicisco. 
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The first six items should have been the most culture-fair items on the 
test. They all were classified as Immediate Recall, and consisted of the 
teacher showing the children pairs of objects, putting the pictures away, and 
then having the children choose the picture that matched the stimulus picture^ 
If it is indeed true that these were culture-fair items, then it can be 
shown from the graphs that some cultural bias is contained in the vast 
majority of items in the ELT. 

To compute the amount of bias, the correlations and associated regression 
lines were computed for whites vs. blacks and whites vs, Spanish-surnamed for 
both the 7 extraordinary items and the remaining 29 items. The results are 
displayed in Table 1. 



Table 1 

Correlations and Regression Equations 



Correlation 

Items 1-6 and 34 
Remaining 29 items 

Regression 

Items 1-6 and 34 
Remaining 29 items 



Regression of 


Regression of 


Blacks on 


Spanlsh-surnamed 


Whites 


on White 


.940 


.996 


.968 


.926 


Y = .973 - 1.72 


Y = .890X + 1.96 


Y = .925X + 4.40 


Y = .902X +5.77 



The amount of bias can be defined as the difference between the scores 
predicted from the "unbiasecl" regression line (the one computed on items 1-6 
and 34) and those predicted from the "biased" regression line. Using a 
typical score of 50, this translates into a bias of 3.72 points for the 
blacks and 4,39 points for the Spanish-surnamed. 

Nonlinear Regression Analysis 

As part of the reporting system of the California Assessment Program, 
multiple linear regression analyses are done for each achievement test, 
relating a series of predictor variables to school tiean achievement. While 
the several regression analyses done at most grades are linear, the third 
grade Reading Test analysis is not. A possible reason for this is because 
the Reading Test is quite easy for third graders, and the subsequent ceiling 
effect produces nonlinear relationships between the predictor variables and 
third grade Reading Test scores. 

The solution to this problem for several years was to include second 
and third order moments of the principal predictor (ELT scores) in the 
regression equation. While this approach removed the nonlinearity , it 
clearly is a patchwork approach to a measurement problem. 



As a potential alternative, the regression was rerun using traiis^^,orm&U 
Reading Test scores. The scores were transformed by taking the nature^- loe 
odds ratio- In (P/(100-P)), This transformation was made, rsther than 
transforming to Rasch scaled scores because a) the two transforiBatior ^- u 
highly similar and b) because of the matrix sampling procedures, no rc^^i' '^tl^'«^ 
method for obtaining a scaled ability estimate for a school has been ^gx^^.^d 
upon. 

The regressions run on the transformed data were far more line;rr tha^ 
the regressions that had been run on the untransfnrmed Rear! irrg Tes t szfP re% ^ 
This can be seen from the results rsp-orted in Table 2. With the unnrarasfonaed 
data, the square^nd cube of ELT deviations loaded before moiiility s-'-^-^^es, 
added a total of ^649 percent to ties variance accounted for^ and hac be'ta 
weights of -.IS and -.lO, respecti^^y. After the Reading llest scopes - 
been transformed, the square and citttp. of ELT deviations were the lasr" ^-uris hl es 
to be loaded, added only ,107 percent to the variance accounted for, j 
had beta weights of just ,03 and -.11, respectively • 

A negative finding that resulted from the trnns format ion is ths" 
variance is accounted for by the predictors (84.7% vs. 87,3% on the 
formed data). This may be occurring because some of the pxedictcr- vi^ ^"^a^j^-r 
need to be transformed as well. In particular, ELT, AFDC rate and rthe 
bilingualism rate all have skewed distributions. If they were trare?^fq>rniKd, 
it is possible that the regressions would become even more linear:^ '^le %t 
the same time increasing the variance accounted for. In support of \.'s 
hypothesis, note that when the Reading Test results were transformed^ tfree 
simple correlations between Reading Test scores and the two relativ ly 
symmetrical variables (socioeconomic status and mobility) increasec? si ij|j^ixti • , 
while the correlations to the three variables with skewed distribucxoa^ 
(ELT, AFDC Late and bilingualism rate) declined. 

C omparing School Growth from Grade Two to Grade Three 

The same Reading Test is given both to all second and third f 
in California. While the test is moderately difficult for second g> - ^-^ 
it ts quite easy for third graders. The ceiling effect that complice 
the third grade regression analysis also presented problems when i: ^ ^ 
desired to examine growth from grade two to grade three. 

This issue arose when, under a contract from the California S-r 
Department of Education, staff at SRI International wanted to coii^_ he 
changes of test scores from grade two to grade three of different ' of 
classes in schools in California. By drawing ogives on linear/non 1 -aph 
paper, it could be demonstrated that higher scoring. schools exhibx: r _ess 
percentage growth on the Reading Test from grade two to ^rade thise ^ isn did 
lower scoring schools. The ogives are shown as Figure 3^ Since zsz^^ result 
is contrary to all expectation (it is usually observed t&at higher sc'^ring 
children exhibit more achievement growth, not less, when advancing z: 
higher grades), it was assumed to be caused by the ceilisig effect ~:rr^tfe 
third grade Reading Test , 
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Table 2 



R»^ffiiil2i5 of Multiple i^sgression Analysis 
aa ^ird Grade Reacang Test Sci:?res 

I^trsxs^sn^d iL^-.,^:^ T^st Scores 





Multiple 


R Sqxmxe 


r— nple 






R Square 


Change 




Beta 




.67482 


.67482 


.821^-?- 


..37258 


isfDC rate 


.72899 


.05417 


-.693:^' 


-.24975 


— -ingualr? rrsr 


.74575 


.01676 


-.6337' 


-.12376 


• _r_io e ccnor .ic Jtm rie 


.75516 


.00940 


.7S84'- 


.20835 




.75998 


.00482 


' . 0 


-.17973 


■.2ZT~27. 3: )*'■ 


.76165 


.00167 




-.10481 


Mocilit^ 


.76212 


.00046 


-;i^-09 


-.02228 



T— =mf fc:,:Htei Reac.iii^ Test Scores 





Multiple 


R Square 


Simple 




Variable 


R Square 


Change 


R 


Beta 


E_E 


.62195 


.62195 


, .78864 


.39197 


;Safirioecjoi?5ini:-c S^xcus 


.69673 


.07478 


.77847 


.28059 


4f°DC rst - 


.70937 


.01264 


-.66579 


-.19341 


Bi -ingu£.-,_isTn rare 


.71429 


.00492 


-.60463 


-.12412 


^ •O.llifcp: 


.71572 


.00143 


-.16646 


-.03750 


■.T-2r-321)**2 


.71677 


.00104 


-.43131 


• .03412 


r.Cj:-27 .321) **3 


.71679 


.00003 


.46035 


-.01325 
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The prob^ility that ckse problem was djsH to ceiliCLg ^affecr =lso cculd be 
showa from Eigure 3. If t^e distribution of- scores was ironnai. zhe ogive 
on linear/normal paper wcnild be a straight li-ne. These ogives ^asre fairly 
linear until school measr- scores reached apprcximately 80 percezrr, aind then 
began to cmrve. away front the straight line.. This indicates th^ distribu- 
tion of scimol mean scofe^ was negatively skewed, a likely outoncsei Twhen 
ceiling eFTf^ t is encoumririred. 

The svJiiaaol mean scczsirs were scaled b; converting rhem to a odds 
ratio, and ':::22e ogives r^ssrssra. These ogl^-^s are shown in Figur*;-- This 
time, the x^czves were a-jms:: perfectly lin^aar. The ogive for grsse two was 
almas - parr^^l to the zirLxd grade ogive, except that higher scrm^ng schools 
sho»>i slzr^inly more grcsmih t rom grade two to grade three than -dur. lower 
scorlxx.e fcinonls. This i^suln was consonant with expectations, -33331 it jsas 
rnn c -m r'^-o thst the log cnzis rtizio scaling presented a more accur^tiH picture 
of znatn^ from grade twc zo ^rade three than did percentage correct: scores. 

Ana-L- ". ^ Growth in Contenci A^as 

Aft r each year's tes: in^, the Reading Test results are presented to 
a n— c ry committee for tmexr review. The test has four major content 
av&is :k identification , ^xrcabulary, comprehension, and study-locational 
sk JLs) . ^nd one question committee posed was, "in which of the four 
ar^as i^ :ae growth the gre test from grade two to grade thr«».e?" The 
chaitges 1 -1 percent correc* cores were greatest for comprehension and vocab- 
uLar-r, ar i lowest for wor :entif ication and s tudy-locational skills, but 
thir res*, -t was confounde xy the fact th;?.c word identification and study- 
lotiftiana^ items were the asiest on the test, and comprehension items the 
handiest. Thus, it was li that ceiling problems were having differential 
ef^'cts en the results fc'^ "he four content areas. 

Given the fact that jading scores, relative to national norms, declined 
in Llalifomia after the tnird grade, the committee had expected to find growth 
tcz be poorest in comprehension. Since changes in percent correct scores were 
gi^acest in comprehension, the suspicion was that ceiling effect was confound- 
ing the interpretation. 

Table 3 shows the results presented to the advisory committee. It shows 
that gains from grade two to grade three are negatively correlated with the 
difficulty of the content area. 

To address this problem, Rasch difficulties were computed at each grade 
for the 250 items on the test. To do this, a two percent systematic sample 
of the students tested statewide was selected, and the analysis done on the 
samples of approximately 6,000 students per grade. The 250 items on the 
Reading Test had been divided into 10 parallel forms. Therefore, 10 analyses 
were done at each grade, each consisting of the responses of approximately 
600 students to 25 items each. After Rasch difficulties were computed for 
each item at each grade, they were stjoraned across items for each content area 
to get a mean difficialty for each area. 
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Figure 4 

Ogive of California public 
school mean scores, converted 
to log odds ratio, on the 
Reading Test , Spring, 1975, 
administration 
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Table 4 sbcm :die results presented in terms of Rasch scaled scores. 
Since Rasch scales scores increase as relative difficulty increases, a gain 
in scores frcwr g^^-^^ 2 to grade 3 means that an area became relatively more 
difficult. Ttesasinre, the content area in which the most gain was made was 
Study Locatioui-iil Skiills, the arc:a which was rated third out of four on the 
basis of inci^^efc^nii percentage of correct responses. Similarly, comprehension, 
which appeared^ :33ave the largest increase on the basis of percentage of 
correct resposzss^ was third of the four content areas when rated on the basis 
of the Rasch acialed scores. 



Table 3 



Results 


for the Reading Test, 


1975 




Content Area 


Grade 2 


Grade 3 


Difference 


Word Identification 


75.4 


85.8 


10.4 


Vocabulary 


67.7 


82.6 


14.9 


Comprehensiciu 


61.3 


77.0 


15.7 


Study-Locational Skills 


75.5 


88.0 


12.5 


Total Test 


67.6 


81.3 


13.7 



Table 4 

Results for the Reading Test , 1975, Reported by Rasch Scaled Scores 



Correlation 
Between Grade 2 
and Grade 3 Scores, 



Content Area 


Grade 2 


Grade 3 


Difference 


Computed over Items 


Word Identification 


47.18 


47.53 


+.35 


.97 


Vocabulary 


50.18 


49.90 


. -.28 


.89 


Comprehension 


51.91 


51.97 


+.06 


.95 


Study-Locational Skills 


47.43 


46.88 ■ 


-.55 


.93 


Total Test 


50.00 


50.00 


.00 


.95 



iO 



To avoid overlnterpretation of these results, a simple one-way analysis 
of variance was run on the data. (While a multivariate or repeated measures 
design might have been more appropriate and more powerful, the cost of analyzing 
the data in so complex a manner was not judged to be worth the return. The 
analysis of gain scores using four groups of items was judged sufficient to 
provide a ballpark figure concerning statistical significance). The data 
produced an F-ratio of 3.71, with 3 and 246 degrees of freedom, which is 
significant at the .05 level, but not at the .01 level. However, there was no 
contrast of pairs that were significantly different from each other. Consequently, 
it seems safe to assume that if growth from grade two to grade three is greater 
in some content areas than others, those differential gains are so small that- 
they are not readily detectable with current CAP data. 

Summary 

The Rasch model has been around for over a decade now, but practical 
applications of it still are in their infancy.. This paper demonstrates that 
the average measurement practitioner needs to be made more aware of its 
potential uses and power. Four problems faced by the California Assessment 
Program that were either solved in makeshift fashion or left completely 
unresolved were solved simply and straightforwardly by application of the 
Rasch model. If Rasch scaling can be used this effectively, it is important 
that more "f rontliners" be instructed in its use. 



il 

10 



